Estimand: a target quantity to be estimated
Estimator: a function, \(W(\mathbf{x})\) that is a recipe about how to get an estimate from a sample
Estimate: a realized value of \(W(\mathbf{x})\) applied to an actual sample, \(\mathbf{x}\)
The goal of an estimator is to estimate the estimand well.
This is important because it’s what allows us to make inferences about a Population statistic based on a Sample statistic. This is the core of inference. Good estimators will be:
Unbiased
Consistent
Efficient
Intuitive Idea: our estimator, \(\hat{\theta}\) should not consistently mis-estimate \(\theta\) in a systematic way
from Scott Fortmann-Roe
Math Idea: an estimator \(\hat{\theta}\) is unbiased if
\[ \mathbb{E}(\hat{\theta}) = \theta \]
\[ \hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}\mathbf{X}_i \]
\[ \mathbb{E}[\hat{\mu}] = \mathbb{E} \left [ \frac{1}{n}\sum_{i=1}^{n}\mathbf{X}_i\right] = \frac{1}{n}\sum_{i=1}^{n}\mathbb{E} \left [ \mathbf{X}_i\right] = \mu \]
Thus, the sample mean, \(\hat{\mu}\) is an unbiased estimator of the population mean \(\mu\)
Note we’re using \(\hat{}\) in this section (pronounced “hat”) for consistency. Common estimators like the sample mean often have their own symbols like \(\bar{x}\) which you’ll also see used. In general, putting a \(\hat{}\) on something means that we’re creating an estimate of it.
\[ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \left( \mathbf{X}_i - \hat{\mu}\right)^2 \]
\[ \mathbb{E}[\hat{\sigma}^2] = \mathbb{E}\left[ \frac{1}{n} \sum_{i=1}^n \left( \mathbf{X}_i - \hat{\mu}\right)^2 \right] = \frac{n-1}{n}\sigma^2 \]
Thus, the sample mean, \(\hat{\sigma}^2\) is a biased estimator of the population variance \(\sigma^2\)
Note: this is why, when calculating the sample variance, we divide by \(N-1\) instead of \(N\). Intuitively this makes sense: when we estimate the sample mean \(\hat{\mu}\), we’re losing 1 degree of freedom, then we use that estimate to estimate \(\hat{\sigma}^2\)…
Intuitive Idea: as we collect more data (information) the estimator should approximate the estimand more closely.
If we could have \(\infty\) information, our estimator should spit out estimates equal to the estimand.
Math Idea:
\[ \lim_{n \to \infty} \hat{\theta} = \theta \]
Example:
Sample Mean: The Law of Large Numbers guarantees that:
\[ \lim_{n \to \infty} \hat{\mu} = \mu \]
Intuitive Idea: as you get more and more independent, random samples of \(\mathbf{X}\), the sample mean of \(\mathbf{X}\) will get closer and closer (and eventually converge) to it’s expected value.
Math Idea: for all \(\epsilon > 0\), if \(\sigma^2 < \infty\)
\[ \lim_{n \to \infty} P(|\bar{X_n} - \mu| < \epsilon) = 1 \]
For a random variable \(X\) with finite variance \(\sigma^2\) and expected value \(\mu\)
\[ P(|\bar{X_n} - \mu| \geq \epsilon) = \frac{Var(\bar{X})}{\epsilon^2} = \frac{\sigma^2}{n\epsilon^2} \]
As \(n \to \infty\), \(\frac{\sigma^2}{n\epsilon^2} \to 0\). So the probability that \(|\bar{X_n} - \mu| \geq \epsilon\) goes to \(0\). Thus, \(P(|\bar{X_n} - \mu| < \epsilon) \to 1\)
Note: we’re using Chebychev’s Inequality here which states that \(P(|X-\mu| \geq k) \leq \frac{\sigma^2}{k^2}\)
Intuitive Idea: the estimate we get should have the smallest variance possible (so that we can be more confident about our estimate with as little data as possible)
Math Idea:
\[ Var(\hat{\theta}) \geq \frac{1}{I(\theta)} \]
where \(I(\theta)\) is the Fisher Information for \(\theta\)
Intuitive Idea: the amount of information that a sample from a random variable \(\mathbf{X}\) can give us about a parameter \(\theta\)
Imagine that everyone in Room A has the same number of cats (\(\mu_A\))
Imagine that cat ownership in Room B is defined as \(Pois(\mu_B)\) where \(\mu_B\) is the mean number of cats owned in Room B
In which Room do I learn more about \(\mu\) by asking one person how many cats they own?
Intuitive Idea: the amount of information that a sample from a random variable \(\mathbf{X}\) can give us about a parameter \(\theta\)
Imagine that Room A has \(\text{height}_{cm} \sim \mathcal{N}(\mu_A, 8)\)
Imagine that Room B has \(\text{height}_{cm} \sim \mathcal{N}(\mu_B, 1)\)
In which Room do I learn more about \(\mu\) by asking one person their height?
Slightly Mathy Idea: Fisher Information measures how sensitive the log-likelihoodfunction \(\ell(\theta | X)\) is to changes in \(\theta\) (more sensitive \(\to\) more information)
Math Idea:
\[ I_X(\theta) = -\mathbb{E}\left[\frac{\partial^2 \ell(\theta | X)}{\partial\theta^2} \right] \]
where \(\ell(\theta | X)\) is the log-likelihood of \(\theta\) given \(X\). If \(\ell\) is sensitive to changes in \(\theta\), the second derivative should be large and we expect to see high information
Note: usually \(\ell\) is concave down around maximum likelihood estimate, meaning that the second-derivative will be negative, hence the negative sign in front of the expectation
Estimators take in sample data and produce a sample estimate
Good estimators produce estimates that allow us to make inferences about population parameters
These estimates are (so far) individual numbers that are guesses for population parameters
Point Estimate: a single value calculated based on a sample that estimates a population parameter
Interval Estimate: a range of values calculated based on a sample that estimate a population parameter with uncertainty
E.g. the mean height of Michaels is 178cm vs. the mean height of Michaels is between 175-181cm
Think about the research or industry work you’ve done. When would interval estimates have been helpful?
Think about the research or industry work you’ve done. When would interval estimates have been helpful?
My Story: the clients who didn’t report uncertainty…
In the last class, we talked about the first problem of inference: data is too complex to reason about, we need summaries. But now that we’ve exhaustively discussed the problem of point estimates, we run into the second problem of inference…
Uncertainty.
Claim: My mean crossword time is faster than yours.
Is this 👆 enough to convince you that my mean time is faster than yours? Why/Why not?
Pro: 🤷♀️ the sample mean is an unbiased estimate
First problem: we need some more data…
Now that we have more data, Is this 👆 enough to convince you that my mean time is faster than yours? Why/Why not?
Frequentism main Ideas:
data (\(X\)) is a random sample of our process \(P_{\theta}\), the parameters (\(\theta\)) are fixed
inference relies on the idea of repeated samples of \(X\) from \(P_{\theta}\)
probabilities are the long run frequency of an event
\[ p = \lim_{n \to \infty} \frac{k}{n} \]
Bayesianism main Ideas:
data \(X\) is fixed, and the parameters \(\theta\) of our process \(P_{\theta}\) are random
inference relies on the idea of updating prior beliefs based on evidence from the data
probabilities are used to quantify uncertainty we have about parameters
\[ \underbrace{p(\theta|d)}_\text{posterior} = \underbrace{\frac{p(d|\theta)}{p(d)}}_\text{update} \times \underbrace{p(\theta)}_\text{prior} \]
In the past month, it’s rained on 9 of the 30 days. 🌧️
Frequentist: the probability of rain is \(q = \frac{9}{30} = 0.3\)
Bayesian: before seeing the data, values of \(q\) near \(0.1\) sounded the most reasonable based on my knowledge of California. After seeing the data, I think the probability of rain \(q\) is most likely \(0.25\) but there’s a lot of uncertainty.
In a frequentist analysis, uncertainty represents sampling variability: the difference between estimates on repeated, similar samples.
💡 if i took a bunch of random samples exactly like this one, how variable would my estimates be?
If I take repeated random samples of 2 people from my list of J’s, what are the different mean ages that I get?
(hint: sample(c(17,24,21,19,25,28,23,20,23,25), size = 2, replace = TRUE)))
| idx | Name | Age |
|---|---|---|
| 1 | John | 17 |
| 2 | James | 24 |
| 3 | Jane | 21 |
| 4 | June | 19 |
| 5 | Joachim | 25 |
| 6 | Jess | 28 |
| 7 | Javier | 23 |
| 8 | Jaques | 20 |
| 9 | Julie | 23 |
| 10 | Jackson | 25 |
If I take repeated random samples of 8 people from my list of J’s, what are the different mean ages that I get?
(hint: sample(c(17,24,21,19,25,28,23,20,23,25), size = 8, replace = TRUE)))
| idx | Name | Age |
|---|---|---|
| 1 | John | 17 |
| 2 | James | 24 |
| 3 | Jane | 21 |
| 4 | June | 19 |
| 5 | Joachim | 25 |
| 6 | Jess | 28 |
| 7 | Javier | 23 |
| 8 | Jaques | 20 |
| 9 | Julie | 23 |
| 10 | Jackson | 25 |
Remember, our sample mean \(\bar{x}\) is our estimate of our population mean \(\mu\). In which case are the sample means we get most certain?
Let’s say income of Chapman workers is \(\text{income} \sim gamma(0.6, 100000)\)
❓if I sample 2 people, how likely is it that I’ll get a sample mean near $500,000
❓if I sample 2,000 people, how likely is it that I’ll get a sample mean near $500,000
The more data we have, the more certain we are about our estimates
The more data we have, the less likely we are to get all extreme values
Sampling Distribution: the theoretical distribution of all possible estimates that result from taking a sample of size \(n\) from \(P_{\theta}\)
e.g. Sampling Distribution of \(\bar{x}\) , Sampling Distribution of \(\sigma\), Sampling Distribution of \(q\)
coin_flips <- sample(0:1, size = 100, replace = TRUE) # heads = 1
mean(coin_flips) # proportion of heads[1] 0.53
Everyone run this code 10 times, and put the proportions of heads in this sheet.
How much uncertainty do we have about \(\hat{q}\)?
# simulated sampling dist
coin_flips <- replicate(10000, mean(sample(0:1, size = 100, replace = TRUE)))
# calculate 5th and 95th percentile
ci <- quantile(coin_flips, c(0.05,0.95))
# plot
ggplot(data = data.frame(x = coin_flips),
aes(x = x)) + geom_histogram(binwidth = 0.02, fill = "blue", color = "darkgray") +
xlim(c(0.2,0.8)) +
geom_segment(x = ci[[1]],
xend = ci[[2]],
y = -1,
yend = -1,
linewidth = 2) +
labs(x = expression(hat(q)),
y = "",
title = "Sampling Distribution of Sample Prop")# simulated sampling dist
coin_flips <- replicate(10000, mean(sample(0:1, size = 100, replace = TRUE)))
# calculate mean
mu <- mean(coin_flips)
# plot
ggplot(data = data.frame(x = coin_flips),
aes(x = x)) + geom_histogram(binwidth = 0.02, fill = "blue", color = "darkgray") +
xlim(c(0.2,0.8)) +
geom_vline(xintercept = mu,
linewidth = 2) +
labs(x = expression(hat(q)),
y = "",
title = "Sampling Distribution of Sample Prop")So far, we used Monte Carlo simulations to approximate the sampling distribution. But, often we can directly calculate it instead.
Claim: Sampling Distributions of sample means will (often) be a Normal Distribution, and we can use what we know about Normal Distributions to calculate our point estimate (best guess) and our interval estimates (uncertainty).
For a random variable \(X\) with finite variance \(\sigma^2\) and expected value \(\mu\)
\[ P(|\bar{X_n} - \mu| \geq \epsilon) = \frac{Var(\bar{X})}{\epsilon^2} = \frac{\sigma^2}{n\epsilon^2} \]
As \(n \to \infty\), \(\frac{\sigma^2}{n\epsilon^2} \to 0\). So the probability that \(|\bar{X_n} - \mu| \geq \epsilon\) goes to \(0\). Thus, \(P(|\bar{X_n} - \mu| < \epsilon) \to 1\)
💡 In other words, \(\bar{X}_n \to \mu\), as \(n \to \infty\). The larger our sample, the more concentrated the sampling distribution will be around \(\mu\)
Note: this is the Central, Limit-Theorem:
Let \(\mathbf{X}\) be a random variable with finite variance \(\sigma^2\). As \(n \to \infty\), the distribution of sample means will be distributed as a normal distribution:
\[ \bar{x} \sim \mathcal{N}\left( \mu, \frac{\sigma^2}{n}\right) \]
Why is this useful?
we know a lot about the normal distribution, no more simulating to find out the point/interval estimate, we can calculate it directly!
even if the population data is not normally-distributed, the sampling distribution of \(\bar{x}\) will be!
\[ \hat{q} \sim \mathcal{N}\left( \mu, \frac{\sigma^2}{n}\right) \]
where \(\mu = \hat{q}\), and \(\sigma^2 = \hat{p}\hat{q}\) (the mean and variance of a bernoulli variable)
Now it’s easy to calculate the point estimate, and any interval estimate we want!
q <- mean(coin_flips)
n <- 100 # sample_size
# middle 90% of estimates
qnorm(c(0.05,0.95), mean = q,
sd = sqrt((q*(1-q))/n))[1] 0.4169164 0.5814016